Skip to content

Add Isaac-0.2-2B-Preview VLM contrib model#154

Open
jimburtoft wants to merge 4 commits intoaws-neuron:mainfrom
jimburtoft:contrib/isaac-0.2-2b
Open

Add Isaac-0.2-2B-Preview VLM contrib model#154
jimburtoft wants to merge 4 commits intoaws-neuron:mainfrom
jimburtoft:contrib/isaac-0.2-2b

Conversation

@jimburtoft
Copy link
Copy Markdown
Contributor

@jimburtoft jimburtoft commented May 1, 2026

Note: The below template includes items meant for model contributions only.

Description

Isaac-0.2-2B-Preview is a 2.57B vision-language model from PerceptronAI, combining a standard Qwen3 text backbone with a SigLIP2 vision encoder and 2-layer MLP projector with pixel shuffle. Onboarded to Neuron via NxDI's NeuronBaseForImageToText framework.

Validated on trn2.3xlarge (LNC=2, TP=1, BF16) with text-only cosine similarity 0.999978 vs CPU reference, 110.7 tok/s text-only and 108.7 tok/s image+text generation.

Model Information

Model Name: Isaac-0.2-2B-Preview

Model Architecture: Vision-language model (SigLIP2 encoder + pixel shuffle + 2-layer MLP projector + Qwen3 decoder)

HuggingFace: PerceptronAI/Isaac-0.2-2B-Preview

License: CC-BY-NC-4.0

Checklist

Required Components

  • Accuracy Test (test/integration/validate_text_logits.py)

    • Compares Neuron BF16 first-token logits against CPU FP32 reference across 5 text prompts
    • Average cosine similarity: 0.999978
    • Top-1 match: 5/5, Top-5 overlap: 5.0/5, Top-10 overlap: 9.8/10
    • Additional accuracy tests: validate_image_text.py (3 image+text E2E tests), validate_vision_encoder.py, validate_tkg.py
  • README.md with the following sections:

    • Usage Example: Compile and run examples for text-only and image+text inference
    • Compatibility Matrix: Validated on trn2.3xlarge (LNC=2, TP=1/2/4) with SDK 2.29
    • Example Checkpoints: PerceptronAI/Isaac-0.2-2B-Preview
    • Testing Instructions: How to run each test script
    • Benchmark Results: Performance numbers for text-only and image+text
    • Known Limitations: BS>1, NKI kernel constraints, vLLM image+text
  • Source Code (src/isaac_neuron/)

    • modeling_isaac.py: Top-level VLM orchestrator (NeuronBaseForImageToText)
    • modeling_isaac_text.py: Text backbone (NeuronBaseModel wrapping NxDI Qwen3 layers)
    • modeling_isaac_vision.py: Vision encoder wrapper (SigLIP2 + pixel shuffle + MLP projector)
    • siglip/modeling_siglip.py: SigLIP2 encoder (adapted from Gemma3-vision contrib)
    • siglip/layers.py: Parallel Conv2d for vision patch embedding
    • ndxi_patch.py: SDK 2.29 compatibility patches
    • utils.py: Shared utilities

Optional Components

  • Integration Tests (test/integration/)

    • validate_text_logits.py: First-token logit accuracy (CPU vs Neuron)
    • validate_tkg.py: Token generation quality and throughput
    • validate_image_text.py: End-to-end multimodal generation
    • validate_vision_encoder.py: Vision encoder output validation
    • test_tp.py: Tensor parallelism at TP=1, 2, 4
    • test_kernels.py: NKI kernel compatibility sweep
    • test_scaling.py: Sequence length scaling (1024-8192)
    • test_weight_loading.py: State dict key mapping validation
    • benchmark.py: Formal benchmark harness (10 iterations, 3 warmup)
    • run_isaac.py: Quick compile + run utility
  • vLLM Integration (vllm/)

    • patch_vllm_isaac.py: Automated 3-file vllm-neuron patch script
    • run_offline_inference.py: Offline inference example
    • run_online_inference.py: OpenAI-compatible API client
    • start-vllm-server.sh: Server launch script
    • README.md: Setup and usage documentation
    • Status: Text-only serving works (~78 tok/s). Image+text has a known pixel_values format mismatch.
  • GPU Benchmark (gpu_benchmark/)

    • benchmark_gpu.py: L40S benchmark script (vLLM 0.20.0, CUDA graphs enabled)
    • gpu_benchmark_results.json: Full results (4 workloads)

Folder Structure

/contrib/models/Isaac-0.2-2B/
  README.md
  /src/isaac_neuron/
    __init__.py
    modeling_isaac.py
    modeling_isaac_text.py
    modeling_isaac_vision.py
    ndxi_patch.py
    utils.py
    /siglip/
      __init__.py
      modeling_siglip.py
      layers.py
  /test/
    __init__.py
    /integration/
      __init__.py
      benchmark.py
      run_isaac.py
      test_kernels.py
      test_scaling.py
      test_tp.py
      test_weight_loading.py
      validate_image_text.py
      validate_text_logits.py
      validate_tkg.py
      validate_vision_encoder.py
  /vllm/
    README.md
    add_execute_model.py
    patch_vllm_isaac.py
    run_offline_inference.py
    run_online_inference.py
    start-vllm-server.sh
  /gpu_benchmark/
    benchmark_gpu.py
    gpu_benchmark_results.json
    nuke_perceptron_import.py
    patch_gpu_modular.py
    setup_gpu.sh
    fix_indent.py

Testing

How did you test this change?

All tests run on trn2.3xlarge (LNC=2, TP=1) with Neuron SDK 2.29 (DLAMI 20260410, NxDI 0.9.17334).

  1. Accuracy: First-token logits compared against CPU FP32 reference — avg cosine 0.999978 across 5 prompts
  2. Text generation: 5 text-only prompts generate coherent output at 94-111 tok/s
  3. Image+text: 3 multimodal prompts generate correct image descriptions at 104-108 tok/s
  4. Tensor parallelism: TP=1, 2, 4 all compile and pass accuracy gates (cosine 0.9999+)
  5. Sequence scaling: seq_len 1024-8192 all compile and run correctly
  6. NKI kernels: CTE flash attention works; MLP/QKV kernels documented as incompatible at TP=1
  7. vLLM: Text-only serving verified via offline inference (~78 tok/s)
  8. GPU comparison: L40S benchmark via vLLM 0.20.0 with CUDA graphs

Benchmark Results (trn2.3xlarge, TP=1, BF16, seq_len=1024, 10 iterations):

Mode Throughput TPOT
Text-only 110.7 tok/s 9.0ms
Image+text 108.7 tok/s 9.2ms
Projected DP=4 ~443 tok/s

GPU Comparison (L40S, BF16, vLLM 0.20.0, CUDA graphs enabled):

Metric L40S GPU trn2 Neuron (TP=1) trn2 Neuron (DP=4)
TPOT (short input) 5.75ms 9.0ms
Throughput (short) 174 tok/s 111 tok/s ~443 tok/s
TPOT (long input) 6.09ms 9.0ms
Throughput (long) 164 tok/s 111 tok/s ~443 tok/s

L40S GPU is 1.5x faster per-core than a single NeuronCore. At the device level (DP=4), trn2.3xlarge is 2.5x faster than L40S.

NxDI implementation of PerceptronAI/Isaac-0.2-2B-Preview VLM:
- Qwen3 text backbone with SigLIP2 vision encoder
- 2-layer MLP projector with pixel shuffle (64 vision tokens/image)
- Supports TP=1/2/4, seq_len up to 8192
- 110.7 tok/s text-only, 108.7 tok/s image+text on trn2.3xlarge
- 9.0ms TPOT at seq_len=1024
- BF16, CTE flash attention enabled
- Validated: cosine 0.9999+ vs CPU reference across all configs
- vLLM-neuron integration with 3-file patch (text-only working, ~78 tok/s)
- GPU comparative benchmark: L40S at 52 tok/s vs trn2 at 111 tok/s (2.13x speedup)
- modular_isaac.py perceptron import fix (nuke_perceptron_import.py)
- execute_model override for logits-to-token-ID conversion
- Known limitation: image+text via vLLM not yet supported (pixel_values format mismatch)
Previous benchmark used enforce_eager=True which handicapped GPU to 52 tok/s.
With CUDA graphs + torch.compile + FlashAttention v2, L40S achieves 174 tok/s.
GPU is 1.5x faster per-core than single NeuronCore, but trn2 DP=4 is 2.5x faster at device level.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant